Reproducible Stata analysis and reporting

Authors

Bongani Ncube

University Of the Witwatersrand (Biostatistician)

Published

28 February 2025

Introduction

This report details reproducible research using Stata , it was compiled using Quarto and Rstudio.

Stata Layout

Results Window

The big central section is the Results window, where you’ll see the results of the commands you run. Under it is the Command window, where you’ll type commands when you’re working interactively.

History Window

On the left is the History window, which contains a history of the commands you’ve run. Click once on a command to paste it back into the Command window for editing. Double-click on a command to run it again. You can also press Page Up when you’re in the Command window to recall past commands. Right-click on a command or block of commands to copy it into the clipboard or send it to the Do File editor. This allows you to take something you’ve done interactively and turn it into part of a do file.

Working Directory

Beneath the History window Stata displays the working directory. This is where Stata will save files if you don’t specify another location.

Variable window

On the right is the Variables windows, which contains a list of the variables in the current data set. Click once on a variable name to select it, and information about the variable will be shown in the Properties window on the bottom right. Click twice, and the variable name will be pasted into the Command window. You can also start typing a variable name in the Command window and press Tab, and Stata will either complete the variable name or give you a list of variables that match what you’ve typed so far.

Properties window

The Properties window also has a section for properties of the data set. One to keep an eye on is the size, or how much memory it requires. Stata must load your entire data set into memory. Modern computers have so much memory that most Stata users never have to worry about it, but big data users must make sure they don’t run out of memory. If you try to use more memory than your computer has, the operating system will use disk space as memory and Stata will become so slow that it’s practically unusable.

If a command is running, the button on the far right of the top toolbar will turn stop-sign red. Clicking it will tell Stata to stop what it’s doing—though it may take some time to notice. Pressing q will do the same thing.

Structure of a Stata Data Set

Let’s begin by opening a Stata system dataset named auto and examining its structure:

sysuse auto, clear
(1978 automobile data)

make price mpg rep78 headroom trunk weight length turn displacement gear_ratio foreign
AMC Concord 4099 22 3 2.5 11 2930 186 40 121 3.58 0
AMC Pacer 4749 17 3 3.0 11 3350 173 40 258 2.53 0
AMC Spirit 3799 22 NA 3.0 12 2640 168 35 121 3.08 0
Buick Century 4816 20 3 4.5 16 3250 196 40 196 2.93 0
Buick Electra 7827 15 4 4.0 20 4080 222 43 350 2.41 0
Buick LeSabre 5788 18 3 4.0 21 3670 218 43 231 2.73 0
Buick Opel 4453 26 NA 3.0 10 2230 170 34 304 2.87 0
Buick Regal 5189 20 3 2.0 16 3280 200 42 196 2.93 0
Buick Riviera 10372 16 3 3.5 17 3880 207 43 231 2.93 0
Buick Skylark 4082 19 3 3.5 13 3400 200 42 231 3.08 0

This opens Stata’s Data Editor, which shows you your data set in a spreadsheet-like form, in browse mode. You can also invoke the Data Editor in edit mode by typing edit or clicking the button that looks like a pencil writing in a spreadsheet. Then it will allow you to make changes. You might use edit mode for data entry, but since you should never change your data interactively get in the habit of using browse mode so you don’t make changes by accident.

Observations and Variables

A Stata data set is a matrix, with one row for each observation and one column for each variable. This raises the question “What is an observation in this data set?” The values of the make variable suggests they are cars, but are they individual cars or kinds of cars? The fact that there is just one row for each value of make suggests kinds of cars.

Manually inputing data

input str12 name
  Ringo         
  John          
  Paul          
  George        
end  
             name
  1.   Ringo         
  2.   John          
  3.   Paul          
  4.   George        
  5. end  

The first line tells Stata that we are going to input data for a string variable called name. The number 12 tells input that we want the string variable to allow up to 12 characters for each observation. The next four lines are the raw data, which include the names Ringo, John, Paul, and George. The word end in the last line tells Stata that we are finished adding data.

Elements of Stata Syntax

Almost all Stata commands use a standard syntax. This syntax allows you to control what part of the data set the command acts on, modify what the command does, and more.

[by varlist]: command [varlist] [=exp] [if exp] [in range] [weight] [using filename][, options]

We’ll discuss five syntax elements:

  • Commands
  • Variable Lists
  • If Conditions
  • Options
  • By Groups

Stata Commands

Stata is a command-based language. Most Stata commands are verbs. They tell Stata to do something: summarize, tabulate, regress, etc. Normally the command itself comes first and then you tell Stata the details of what you want it to do after.

Many commands can be abbreviated: sum instead of summarize, tab instead of tabulate, reg instead of regress. Commands that can destroy data, like replace, cannot be abbreviated.

We’ll explore the elements of Stata syntax using a command that makes it easy to see what they do and works well in a web book, list. It lists your data set in the Results window. (I apologize for all the scrolling you’ll need to do in this chapter! .

list
  1. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | AMC Concord       |  4,099 |  22 |     3 |      2.5 |    11 |  2,930 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      186   |     40   |        121   |       3.58    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  2. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | AMC Pacer         |  4,749 |  17 |     3 |      3.0 |    11 |  3,350 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      173   |     40   |        258   |       2.53    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  3. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | AMC Spirit        |  3,799 |  22 |     . |      3.0 |    12 |  2,640 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      168   |     35   |        121   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  4. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Century     |  4,816 |  20 |     3 |      4.5 |    16 |  3,250 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      196   |     40   |        196   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  5. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Electra     |  7,827 |  15 |     4 |      4.0 |    20 |  4,080 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      222   |     43   |        350   |       2.41    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  6. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick LeSabre     |  5,788 |  18 |     3 |      4.0 |    21 |  3,670 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      218   |     43   |        231   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  7. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Opel        |  4,453 |  26 |     . |      3.0 |    10 |  2,230 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      170   |     34   |        304   |       2.87    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  8. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Regal       |  5,189 |  20 |     3 |      2.0 |    16 |  3,280 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      200   |     42   |        196   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
  9. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Riviera     | 10,372 |  16 |     3 |      3.5 |    17 |  3,880 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      207   |     43   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 10. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Buick Skylark     |  4,082 |  19 |     3 |      3.5 |    13 |  3,400 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      200   |     42   |        231   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 11. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Cad. Deville      | 11,385 |  14 |     3 |      4.0 |    20 |  4,330 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      221   |     44   |        425   |       2.28    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 12. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Cad. Eldorado     | 14,500 |  14 |     2 |      3.5 |    16 |  3,900 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      204   |     43   |        350   |       2.19    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 13. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Cad. Seville      | 15,906 |  21 |     3 |      3.0 |    13 |  4,290 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      204   |     45   |        350   |       2.24    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 14. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Chevette    |  3,299 |  29 |     3 |      2.5 |     9 |  2,110 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      163   |     34   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 15. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Impala      |  5,705 |  16 |     4 |      4.0 |    20 |  3,690 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      212   |     43   |        250   |       2.56    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 16. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Malibu      |  4,504 |  22 |     3 |      3.5 |    17 |  3,180 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      193   |     31   |        200   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 17. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Monte Carlo |  5,104 |  22 |     2 |      2.0 |    16 |  3,220 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      200   |     41   |        200   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 18. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Monza       |  3,667 |  24 |     2 |      2.0 |     7 |  2,750 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      179   |     40   |        151   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 19. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Chev. Nova        |  3,955 |  19 |     3 |      3.5 |    13 |  3,430 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      197   |     43   |        250   |       2.56    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 20. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Dodge Colt        |  3,984 |  30 |     5 |      2.0 |     8 |  2,120 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      163   |     35   |         98   |       3.54    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 21. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Dodge Diplomat    |  4,010 |  18 |     2 |      4.0 |    17 |  3,600 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      206   |     46   |        318   |       2.47    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 22. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Dodge Magnum      |  5,886 |  16 |     2 |      4.0 |    17 |  3,600 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      206   |     46   |        318   |       2.47    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 23. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Dodge St. Regis   |  6,342 |  17 |     2 |      4.5 |    21 |  3,740 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      220   |     46   |        225   |       2.94    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 24. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Ford Fiesta       |  4,389 |  28 |     4 |      1.5 |     9 |  1,800 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      147   |     33   |         98   |       3.15    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 25. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Ford Mustang      |  4,187 |  21 |     3 |      2.0 |    10 |  2,650 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      179   |     43   |        140   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 26. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Linc. Continental | 11,497 |  12 |     3 |      3.5 |    22 |  4,840 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      233   |     51   |        400   |       2.47    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 27. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Linc. Mark V      | 13,594 |  12 |     3 |      2.5 |    18 |  4,720 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      230   |     48   |        400   |       2.47    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 28. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Linc. Versailles  | 13,466 |  14 |     3 |      3.5 |    15 |  3,830 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      201   |     41   |        302   |       2.47    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 29. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. Bobcat      |  3,829 |  22 |     4 |      3.0 |     9 |  2,580 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      169   |     39   |        140   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 30. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. Cougar      |  5,379 |  14 |     4 |      3.5 |    16 |  4,060 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      221   |     48   |        302   |       2.75    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 31. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. Marquis     |  6,165 |  15 |     3 |      3.5 |    23 |  3,720 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      212   |     44   |        302   |       2.26    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 32. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. Monarch     |  4,516 |  18 |     3 |      3.0 |    15 |  3,370 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      198   |     41   |        250   |       2.43    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 33. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. XR-7        |  6,303 |  14 |     4 |      3.0 |    16 |  4,130 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      217   |     45   |        302   |       2.75    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 34. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Merc. Zephyr      |  3,291 |  20 |     3 |      3.5 |    17 |  2,830 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      195   |     43   |        140   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 35. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds 98           |  8,814 |  21 |     4 |      4.0 |    20 |  4,060 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      220   |     43   |        350   |       2.41    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 36. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Cutl Supr    |  5,172 |  19 |     3 |      2.0 |    16 |  3,310 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      198   |     42   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 37. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Cutlass      |  4,733 |  19 |     3 |      4.5 |    16 |  3,300 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      198   |     42   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 38. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Delta 88     |  4,890 |  18 |     4 |      4.0 |    20 |  3,690 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      218   |     42   |        231   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 39. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Omega        |  4,181 |  19 |     3 |      4.5 |    14 |  3,370 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      200   |     43   |        231   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 40. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Starfire     |  4,195 |  24 |     1 |      2.0 |    10 |  2,730 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      180   |     40   |        151   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 41. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Olds Toronado     | 10,371 |  16 |     3 |      3.5 |    17 |  4,030 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      206   |     43   |        350   |       2.41    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 42. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Plym. Arrow       |  4,647 |  28 |     3 |      2.0 |    11 |  3,260 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      170   |     37   |        156   |       3.05    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 43. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Plym. Champ       |  4,425 |  34 |     5 |      2.5 |    11 |  1,800 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      157   |     37   |         86   |       2.97    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 44. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Plym. Horizon     |  4,482 |  25 |     3 |      4.0 |    17 |  2,200 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      165   |     36   |        105   |       3.37    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 45. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Plym. Sapporo     |  6,486 |  26 |     . |      1.5 |     8 |  2,520 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      182   |     38   |        119   |       3.54    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 46. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Plym. Volare      |  4,060 |  18 |     2 |      5.0 |    16 |  3,330 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      201   |     44   |        225   |       3.23    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 47. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Catalina    |  5,798 |  18 |     4 |      4.0 |    20 |  3,700 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      214   |     42   |        231   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 48. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Firebird    |  4,934 |  18 |     1 |      1.5 |     7 |  3,470 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      198   |     42   |        231   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 49. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Grand Prix  |  5,222 |  19 |     3 |      2.0 |    16 |  3,210 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      201   |     45   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 50. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Le Mans     |  4,723 |  19 |     3 |      3.5 |    17 |  3,200 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      199   |     40   |        231   |       2.93    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 51. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Phoenix     |  4,424 |  19 |     . |      3.5 |    13 |  3,420 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      203   |     43   |        231   |       3.08    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 52. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Pont. Sunbird     |  4,172 |  24 |     2 |      2.0 |     7 |  2,690 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      179   |     41   |        151   |       2.73    |   Domestic    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 53. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Audi 5000         |  9,690 |  17 |     5 |      3.0 |    15 |  2,830 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      189   |     37   |        131   |       3.20    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 54. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Audi Fox          |  6,295 |  23 |     3 |      2.5 |    11 |  2,070 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      174   |     36   |         97   |       3.70    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 55. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | BMW 320i          |  9,735 |  25 |     4 |      2.5 |    12 |  2,650 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      177   |     34   |        121   |       3.64    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 56. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Datsun 200        |  6,229 |  23 |     4 |      1.5 |     6 |  2,370 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      170   |     35   |        119   |       3.89    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 57. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Datsun 210        |  4,589 |  35 |     5 |      2.0 |     8 |  2,020 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      165   |     32   |         85   |       3.70    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 58. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Datsun 510        |  5,079 |  24 |     4 |      2.5 |     8 |  2,280 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      170   |     34   |        119   |       3.54    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 59. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Datsun 810        |  8,129 |  21 |     4 |      2.5 |     8 |  2,750 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      184   |     38   |        146   |       3.55    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 60. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Fiat Strada       |  4,296 |  21 |     3 |      2.5 |    16 |  2,130 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      161   |     36   |        105   |       3.37    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 61. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Honda Accord      |  5,799 |  25 |     5 |      3.0 |    10 |  2,240 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      172   |     36   |        107   |       3.05    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 62. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Honda Civic       |  4,499 |  28 |     4 |      2.5 |     5 |  1,760 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      149   |     34   |         91   |       3.30    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 63. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Mazda GLC         |  3,995 |  30 |     4 |      3.5 |    11 |  1,980 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      154   |     33   |         86   |       3.73    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 64. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Peugeot 604       | 12,990 |  14 |     . |      3.5 |    14 |  3,420 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      192   |     38   |        163   |       3.58    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 65. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Renault Le Car    |  3,895 |  26 |     3 |      3.0 |    10 |  1,830 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      142   |     34   |         79   |       3.72    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 66. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Subaru            |  3,798 |  35 |     5 |      2.5 |    11 |  2,050 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      164   |     36   |         97   |       3.81    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 67. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Toyota Celica     |  5,899 |  18 |     5 |      2.5 |    14 |  2,410 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      174   |     36   |        134   |       3.06    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 68. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Toyota Corolla    |  3,748 |  31 |     5 |      3.0 |     9 |  2,200 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      165   |     35   |         97   |       3.21    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 69. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Toyota Corona     |  5,719 |  18 |     5 |      2.0 |    11 |  2,670 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      175   |     36   |        134   |       3.05    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 70. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | VW Dasher         |  7,140 |  23 |     4 |      2.5 |    12 |  2,160 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      172   |     36   |         97   |       3.74    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 71. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | VW Diesel         |  5,397 |  41 |     5 |      3.0 |    15 |  2,040 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      155   |     35   |         90   |       3.78    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 72. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | VW Rabbit         |  4,697 |  25 |     4 |      3.0 |    15 |  1,930 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      155   |     35   |         89   |       3.78    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 73. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | VW Scirocco       |  6,850 |  25 |     4 |      2.0 |    16 |  1,990 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      156   |     36   |         97   |       3.78    |    Foreign    |
     +----------------------------------------------------------------------+

     +----------------------------------------------------------------------+
 74. | make              |  price | mpg | rep78 | headroom | trunk | weight |
     | Volvo 260         | 11,995 |  17 |     5 |      2.5 |    14 |  3,170 |
     |----------------------------------------------------------------------|
     |   length   |   turn   |   displa~t   |   gear_r~o    |    foreign    |
     |      193   |     37   |        163   |       2.98    |    Foreign    |
     +----------------------------------------------------------------------+

Variable Lists

Listing one or more variables after a command tells the command it should only act on the variables listed:

If conditions

An if condition tells a command which observations it should act on. It will only act on those observations where the condition is true. This allows you to do things with subsets of the data. An if condition comes after a variable list:

Note the two equals signs! In Stata you use one equals sign when you’re setting something equal to something else (see Creating and Changing Variables) and two equals signs when you’re asking if two things are equal. Other operators you can use are:

! all by itself means “not” and reverses whatever condition follows it.

Internally, Stata equates true and false with one and zero. That means you can write:

list make if foreign
     | make           |
     |----------------|
 53. | Audi 5000      |
 54. | Audi Fox       |
 55. | BMW 320i       |
 56. | Datsun 200     |
 57. | Datsun 210     |
     |----------------|
 58. | Datsun 510     |
 59. | Datsun 810     |
 60. | Fiat Strada    |
 61. | Honda Accord   |
 62. | Honda Civic    |
     |----------------|
 63. | Mazda GLC      |
 64. | Peugeot 604    |
 65. | Renault Le Car |
 66. | Subaru         |
 67. | Toyota Celica  |
     |----------------|
 68. | Toyota Corolla |
 69. | Toyota Corona  |
 70. | VW Dasher      |
 71. | VW Diesel      |
 72. | VW Rabbit      |
     |----------------|
 73. | VW Scirocco    |
 74. | Volvo 260      |
     +----------------+
list make if !foreign
     | make              |
     |-------------------|
  1. | AMC Concord       |
  2. | AMC Pacer         |
  3. | AMC Spirit        |
  4. | Buick Century     |
  5. | Buick Electra     |
     |-------------------|
  6. | Buick LeSabre     |
  7. | Buick Opel        |
  8. | Buick Regal       |
  9. | Buick Riviera     |
 10. | Buick Skylark     |
     |-------------------|
 11. | Cad. Deville      |
 12. | Cad. Eldorado     |
 13. | Cad. Seville      |
 14. | Chev. Chevette    |
 15. | Chev. Impala      |
     |-------------------|
 16. | Chev. Malibu      |
 17. | Chev. Monte Carlo |
 18. | Chev. Monza       |
 19. | Chev. Nova        |
 20. | Dodge Colt        |
     |-------------------|
 21. | Dodge Diplomat    |
 22. | Dodge Magnum      |
 23. | Dodge St. Regis   |
 24. | Ford Fiesta       |
 25. | Ford Mustang      |
     |-------------------|
 26. | Linc. Continental |
 27. | Linc. Mark V      |
 28. | Linc. Versailles  |
 29. | Merc. Bobcat      |
 30. | Merc. Cougar      |
     |-------------------|
 31. | Merc. Marquis     |
 32. | Merc. Monarch     |
 33. | Merc. XR-7        |
 34. | Merc. Zephyr      |
 35. | Olds 98           |
     |-------------------|
 36. | Olds Cutl Supr    |
 37. | Olds Cutlass      |
 38. | Olds Delta 88     |
 39. | Olds Omega        |
 40. | Olds Starfire     |
     |-------------------|
 41. | Olds Toronado     |
 42. | Plym. Arrow       |
 43. | Plym. Champ       |
 44. | Plym. Horizon     |
 45. | Plym. Sapporo     |
     |-------------------|
 46. | Plym. Volare      |
 47. | Pont. Catalina    |
 48. | Pont. Firebird    |
 49. | Pont. Grand Prix  |
 50. | Pont. Le Mans     |
     |-------------------|
 51. | Pont. Phoenix     |
 52. | Pont. Sunbird     |
     +-------------------+

Combining Conditions

You can combine conditions with & (logical and) or | (logical or). The character used for logical or is called the “pipe” character and you type it by pressing Shift-Backslash, the key right above Enter. Try:

This shows you cars that get more than 25 miles per gallon or cost less than $5000. A car only needs to meet one of the two conditions to be shown (meeting both is fine too). In set theory terms it is the union of the two sets.

All the conditions to be combined must be complete. If you wanted to list the cars that have a 1 or a 2 for rep78 you should not use:

list make rep78 if rep78==1 | 2 What this does and why is left to the reader, but it’s not what you want. Instead, you should use:

list make rep78 if rep78==1 | rep78==2
     | make                rep78 |
     |---------------------------|
 12. | Cad. Eldorado           2 |
 17. | Chev. Monte Carlo       2 |
 18. | Chev. Monza             2 |
 21. | Dodge Diplomat          2 |
 22. | Dodge Magnum            2 |
     |---------------------------|
 23. | Dodge St. Regis         2 |
 40. | Olds Starfire           1 |
 46. | Plym. Volare            2 |
 48. | Pont. Firebird          1 |
 52. | Pont. Sunbird           2 |
     +---------------------------+

Options

Options change how a command works. They go after any variable list or if condition, following a comma. The comma means “everything after this is options” so you only type one comma no matter how many options you’re using.

list make foreign, nolabel
     | make                foreign |
     |-----------------------------|
  1. | AMC Concord               0 |
  2. | AMC Pacer                 0 |
  3. | AMC Spirit                0 |
  4. | Buick Century             0 |
  5. | Buick Electra             0 |
     |-----------------------------|
  6. | Buick LeSabre             0 |
  7. | Buick Opel                0 |
  8. | Buick Regal               0 |
  9. | Buick Riviera             0 |
 10. | Buick Skylark             0 |
     |-----------------------------|
 11. | Cad. Deville              0 |
 12. | Cad. Eldorado             0 |
 13. | Cad. Seville              0 |
 14. | Chev. Chevette            0 |
 15. | Chev. Impala              0 |
     |-----------------------------|
 16. | Chev. Malibu              0 |
 17. | Chev. Monte Carlo         0 |
 18. | Chev. Monza               0 |
 19. | Chev. Nova                0 |
 20. | Dodge Colt                0 |
     |-----------------------------|
 21. | Dodge Diplomat            0 |
 22. | Dodge Magnum              0 |
 23. | Dodge St. Regis           0 |
 24. | Ford Fiesta               0 |
 25. | Ford Mustang              0 |
     |-----------------------------|
 26. | Linc. Continental         0 |
 27. | Linc. Mark V              0 |
 28. | Linc. Versailles          0 |
 29. | Merc. Bobcat              0 |
 30. | Merc. Cougar              0 |
     |-----------------------------|
 31. | Merc. Marquis             0 |
 32. | Merc. Monarch             0 |
 33. | Merc. XR-7                0 |
 34. | Merc. Zephyr              0 |
 35. | Olds 98                   0 |
     |-----------------------------|
 36. | Olds Cutl Supr            0 |
 37. | Olds Cutlass              0 |
 38. | Olds Delta 88             0 |
 39. | Olds Omega                0 |
 40. | Olds Starfire             0 |
     |-----------------------------|
 41. | Olds Toronado             0 |
 42. | Plym. Arrow               0 |
 43. | Plym. Champ               0 |
 44. | Plym. Horizon             0 |
 45. | Plym. Sapporo             0 |
     |-----------------------------|
 46. | Plym. Volare              0 |
 47. | Pont. Catalina            0 |
 48. | Pont. Firebird            0 |
 49. | Pont. Grand Prix          0 |
 50. | Pont. Le Mans             0 |
     |-----------------------------|
 51. | Pont. Phoenix             0 |
 52. | Pont. Sunbird             0 |
 53. | Audi 5000                 1 |
 54. | Audi Fox                  1 |
 55. | BMW 320i                  1 |
     |-----------------------------|
 56. | Datsun 200                1 |
 57. | Datsun 210                1 |
 58. | Datsun 510                1 |
 59. | Datsun 810                1 |
 60. | Fiat Strada               1 |
     |-----------------------------|
 61. | Honda Accord              1 |
 62. | Honda Civic               1 |
 63. | Mazda GLC                 1 |
 64. | Peugeot 604               1 |
 65. | Renault Le Car            1 |
     |-----------------------------|
 66. | Subaru                    1 |
 67. | Toyota Celica             1 |
 68. | Toyota Corolla            1 |
 69. | Toyota Corona             1 |
 70. | VW Dasher                 1 |
     |-----------------------------|
 71. | VW Diesel                 1 |
 72. | VW Rabbit                 1 |
 73. | VW Scirocco               1 |
 74. | Volvo 260                 1 |
     +-----------------------------+

Options must always be one word. Here the words “no” and “label” are combined because otherwise Stata would think they were two different options.

Note that browse has very few options (nolabel is one of them). If you’ve been replacing list with browse in your code, stick with list for the rest of the chapter.

Many options require additional information, such as a number or a variable they apply to. This additional information goes in parentheses directly after the option name. The string() option tells the list command to truncate string variables after a given number of characters, with the number going in the parentheses:

list make, string(5)
     | make    |
     |---------|
  1. | AMC C.. |
  2. | AMC P.. |
  3. | AMC S.. |
  4. | Buick.. |
  5. | Buick.. |
     |---------|
  6. | Buick.. |
  7. | Buick.. |
  8. | Buick.. |
  9. | Buick.. |
 10. | Buick.. |
     |---------|
 11. | Cad. .. |
 12. | Cad. .. |
 13. | Cad. .. |
 14. | Chev... |
 15. | Chev... |
     |---------|
 16. | Chev... |
 17. | Chev... |
 18. | Chev... |
 19. | Chev... |
 20. | Dodge.. |
     |---------|
 21. | Dodge.. |
 22. | Dodge.. |
 23. | Dodge.. |
 24. | Ford .. |
 25. | Ford .. |
     |---------|
 26. | Linc... |
 27. | Linc... |
 28. | Linc... |
 29. | Merc... |
 30. | Merc... |
     |---------|
 31. | Merc... |
 32. | Merc... |
 33. | Merc... |
 34. | Merc... |
 35. | Olds 98 |
     |---------|
 36. | Olds .. |
 37. | Olds .. |
 38. | Olds .. |
 39. | Olds .. |
 40. | Olds .. |
     |---------|
 41. | Olds .. |
 42. | Plym... |
 43. | Plym... |
 44. | Plym... |
 45. | Plym... |
     |---------|
 46. | Plym... |
 47. | Pont... |
 48. | Pont... |
 49. | Pont... |
 50. | Pont... |
     |---------|
 51. | Pont... |
 52. | Pont... |
 53. | Audi .. |
 54. | Audi .. |
 55. | BMW 3.. |
     |---------|
 56. | Datsu.. |
 57. | Datsu.. |
 58. | Datsu.. |
 59. | Datsu.. |
 60. | Fiat .. |
     |---------|
 61. | Honda.. |
 62. | Honda.. |
 63. | Mazda.. |
 64. | Peuge.. |
 65. | Renau.. |
     |---------|
 66. | Subaru  |
 67. | Toyot.. |
 68. | Toyot.. |
 69. | Toyot.. |
 70. | VW Da.. |
     |---------|
 71. | VW Di.. |
 72. | VW Ra.. |
 73. | VW Sc.. |
 74. | Volvo.. |
     +---------+

By Groups

By groups allows you to execute a command separately for subgroups within your data. Try:

by foreign: list make
-> foreign = Domestic

     +-------------------+
     | make              |
     |-------------------|
  1. | AMC Concord       |
  2. | AMC Pacer         |
  3. | AMC Spirit        |
  4. | Buick Century     |
  5. | Buick Electra     |
     |-------------------|
  6. | Buick LeSabre     |
  7. | Buick Opel        |
  8. | Buick Regal       |
  9. | Buick Riviera     |
 10. | Buick Skylark     |
     |-------------------|
 11. | Cad. Deville      |
 12. | Cad. Eldorado     |
 13. | Cad. Seville      |
 14. | Chev. Chevette    |
 15. | Chev. Impala      |
     |-------------------|
 16. | Chev. Malibu      |
 17. | Chev. Monte Carlo |
 18. | Chev. Monza       |
 19. | Chev. Nova        |
 20. | Dodge Colt        |
     |-------------------|
 21. | Dodge Diplomat    |
 22. | Dodge Magnum      |
 23. | Dodge St. Regis   |
 24. | Ford Fiesta       |
 25. | Ford Mustang      |
     |-------------------|
 26. | Linc. Continental |
 27. | Linc. Mark V      |
 28. | Linc. Versailles  |
 29. | Merc. Bobcat      |
 30. | Merc. Cougar      |
     |-------------------|
 31. | Merc. Marquis     |
 32. | Merc. Monarch     |
 33. | Merc. XR-7        |
 34. | Merc. Zephyr      |
 35. | Olds 98           |
     |-------------------|
 36. | Olds Cutl Supr    |
 37. | Olds Cutlass      |
 38. | Olds Delta 88     |
 39. | Olds Omega        |
 40. | Olds Starfire     |
     |-------------------|
 41. | Olds Toronado     |
 42. | Plym. Arrow       |
 43. | Plym. Champ       |
 44. | Plym. Horizon     |
 45. | Plym. Sapporo     |
     |-------------------|
 46. | Plym. Volare      |
 47. | Pont. Catalina    |
 48. | Pont. Firebird    |
 49. | Pont. Grand Prix  |
 50. | Pont. Le Mans     |
     |-------------------|
 51. | Pont. Phoenix     |
 52. | Pont. Sunbird     |
     +-------------------+

-------------------------------------------------------------------------------
-> foreign = Foreign

     +----------------+
     | make           |
     |----------------|
  1. | Audi 5000      |
  2. | Audi Fox       |
  3. | BMW 320i       |
  4. | Datsun 200     |
  5. | Datsun 210     |
     |----------------|
  6. | Datsun 510     |
  7. | Datsun 810     |
  8. | Fiat Strada    |
  9. | Honda Accord   |
 10. | Honda Civic    |
     |----------------|
 11. | Mazda GLC      |
 12. | Peugeot 604    |
 13. | Renault Le Car |
 14. | Subaru         |
 15. | Toyota Celica  |
     |----------------|
 16. | Toyota Corolla |
 17. | Toyota Corona  |
 18. | VW Dasher      |
 19. | VW Diesel      |
 20. | VW Rabbit      |
     |----------------|
 21. | VW Scirocco    |
 22. | Volvo 260      |
     +----------------+

The by foreign: prefix tells Stata to:

Identify the unique values of foreign (in this case, 0 and 1 or “Domestic” and “Foreign”) Temporarily split the data set into groups based on their value of foreign Run the subsequent command (list make) separately for each group You’ll see how powerful by is later.

In order for by to work, the data must be sorted by the same variable. You can do that with the sort command:

sort rep78
by rep78: list make
-> rep78 = 1

     +----------------+
     | make           |
     |----------------|
  1. | Pont. Firebird |
  2. | Olds Starfire  |
     +----------------+

-------------------------------------------------------------------------------
-> rep78 = 2

     +-------------------+
     | make              |
     |-------------------|
  1. | Plym. Volare      |
  2. | Pont. Sunbird     |
  3. | Dodge Diplomat    |
  4. | Cad. Eldorado     |
  5. | Dodge St. Regis   |
     |-------------------|
  6. | Chev. Monza       |
  7. | Dodge Magnum      |
  8. | Chev. Monte Carlo |
     +-------------------+

-------------------------------------------------------------------------------
-> rep78 = 3

     +-------------------+
     | make              |
     |-------------------|
  1. | Pont. Grand Prix  |
  2. | Olds Toronado     |
  3. | Olds Cutl Supr    |
  4. | Ford Mustang      |
  5. | Buick Regal       |
     |-------------------|
  6. | AMC Pacer         |
  7. | AMC Concord       |
  8. | Buick Century     |
  9. | Olds Cutlass      |
 10. | Linc. Continental |
     |-------------------|
 11. | Buick LeSabre     |
 12. | Buick Riviera     |
 13. | Pont. Le Mans     |
 14. | Linc. Versailles  |
 15. | Merc. Zephyr      |
     |-------------------|
 16. | Merc. Monarch     |
 17. | Cad. Deville      |
 18. | Merc. Marquis     |
 19. | Renault Le Car    |
 20. | Linc. Mark V      |
     |-------------------|
 21. | Chev. Malibu      |
 22. | Cad. Seville      |
 23. | Plym. Arrow       |
 24. | Fiat Strada       |
 25. | Buick Skylark     |
     |-------------------|
 26. | Chev. Chevette    |
 27. | Chev. Nova        |
 28. | Audi Fox          |
 29. | Olds Omega        |
 30. | Plym. Horizon     |
     +-------------------+

-------------------------------------------------------------------------------
-> rep78 = 4

     +----------------+
     | make           |
     |----------------|
  1. | Olds Delta 88  |
  2. | Datsun 200     |
  3. | VW Dasher      |
  4. | Honda Civic    |
  5. | Ford Fiesta    |
     |----------------|
  6. | Datsun 810     |
  7. | Buick Electra  |
  8. | Merc. Cougar   |
  9. | Merc. XR-7     |
 10. | VW Scirocco    |
     |----------------|
 11. | Olds 98        |
 12. | Pont. Catalina |
 13. | BMW 320i       |
 14. | VW Rabbit      |
 15. | Chev. Impala   |
     |----------------|
 16. | Mazda GLC      |
 17. | Datsun 510     |
 18. | Merc. Bobcat   |
     +----------------+

-------------------------------------------------------------------------------
-> rep78 = 5

     +----------------+
     | make           |
     |----------------|
  1. | Audi 5000      |
  2. | Subaru         |
  3. | Volvo 260      |
  4. | Dodge Colt     |
  5. | Toyota Corona  |
     |----------------|
  6. | Honda Accord   |
  7. | VW Diesel      |
  8. | Datsun 210     |
  9. | Plym. Champ    |
 10. | Toyota Corolla |
     |----------------|
 11. | Toyota Celica  |
     +----------------+

-------------------------------------------------------------------------------
-> rep78 = .

     +---------------+
     | make          |
     |---------------|
  1. | Buick Opel    |
  2. | Pont. Phoenix |
  3. | Plym. Sapporo |
  4. | AMC Spirit    |
  5. | Peugeot 604   |
     +---------------+
Summary Statistics for Continuous Variables

summarize (or just sum) gives you summary statistics which will help you understand the distribution of continuous (quantitative) variables. Start by adding sum all by itself to your do file and runing it by pressing Ctrl-d or clicking the “play” button in the top right of your Stata window, then take a look at the output:

This gives basic summary statistics for all the variables in your data set. Note that there is nothing for make: it is a string variable so summary statistics don’t make sense. Also note that for rep78 the number of observations is 69 rather than 74. That’s because five missing values were ignored and the summary statistics calculated over the remaining 69 values of rep78. Most statistical commands take a similar approach to missing values and that’s usually what you want, so you rarely have to include special handing for missing values in statistical commands.

All the syntax elements you learned earlier also work with statistical commands. To get summary statistics for just mpg, give sum a variable list:

If you want summary statistics for just the foreign cars, add an if condition:

If you want summary statistics of mpg for both foreign and domestic cars, calculated separately, use by:

by foreign: sum mpg
-> foreign = Domestic

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         52    19.82692    4.743297         12         34

-------------------------------------------------------------------------------
-> foreign = Foreign

    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         mpg |         22    24.77273    6.611187         14         41

The detail (d) option will give more information. Try:

sum mpg, detail
                        Mileage (mpg)
-------------------------------------------------------------
      Percentiles      Smallest
 1%           12             12
 5%           14             12
10%           14             14       Obs                  74
25%           18             14       Sum of wgt.          74

50%           20                      Mean            21.2973
                        Largest       Std. dev.      5.785503
75%           25             34
90%           29             35       Variance       33.47205
95%           34             35       Skewness       .9487176
99%           41             41       Kurtosis       3.975005

Frequencies for Categorical Variables

tabulate (tab) will create tables of frequencies, which will help you understand the distribution of categorical variables. It can also be useful for string variables that describe categories or groups.

If you give tab a variable list with one variable it will give you a one-way table, while if you give it two variables it will give you a two-way table (i.e. crosstabs). To get an idea of what tab does, add the following to your do file and run it:

tab rep78 foreign
    Repair |
    record |      Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
         2 |         8          0 |         8 
         3 |        27          3 |        30 
         4 |         9          9 |        18 
         5 |         2          9 |        11 
-----------+----------------------+----------
     Total |        48         21 |        69 

The tab command has a rich set of useful options. The missing values of rep78 were not included in the table, which makes it easy to forget they’re there. Add them with the missing option:

tab rep78 , missing
     Repair |
record 1978 |      Freq.     Percent        Cum.
------------+-----------------------------------
          1 |          2        2.70        2.70
          2 |          8       10.81       13.51
          3 |         30       40.54       54.05
          4 |         18       24.32       78.38
          5 |         11       14.86       93.24
          . |          5        6.76      100.00
------------+-----------------------------------
      Total |         74      100.00

To get percentages in a two-way table add the row, column, or cell options:

tab rep78 foreign, row column cell
| Key               |
|-------------------|
|     frequency     |
|  row percentage   |
| column percentage |
|  cell percentage  |
+-------------------+

    Repair |
    record |      Car origin
      1978 |  Domestic    Foreign |     Total
-----------+----------------------+----------
         1 |         2          0 |         2 
           |    100.00       0.00 |    100.00 
           |      4.17       0.00 |      2.90 
           |      2.90       0.00 |      2.90 
-----------+----------------------+----------
         2 |         8          0 |         8 
           |    100.00       0.00 |    100.00 
           |     16.67       0.00 |     11.59 
           |     11.59       0.00 |     11.59 
-----------+----------------------+----------
         3 |        27          3 |        30 
           |     90.00      10.00 |    100.00 
           |     56.25      14.29 |     43.48 
           |     39.13       4.35 |     43.48 
-----------+----------------------+----------
         4 |         9          9 |        18 
           |     50.00      50.00 |    100.00 
           |     18.75      42.86 |     26.09 
           |     13.04      13.04 |     26.09 
-----------+----------------------+----------
         5 |         2          9 |        11 
           |     18.18      81.82 |    100.00 
           |      4.17      42.86 |     15.94 
           |      2.90      13.04 |     15.94 
-----------+----------------------+----------
     Total |        48         21 |        69 
           |     69.57      30.43 |    100.00 
           |    100.00     100.00 |    100.00 
           |     69.57      30.43 |    100.00 

tab has an option called sum which gives summary statistics for a given variable, calculated over the observations in each cell of the table. Try:

Chi squared test

There’s also a chi2 option that runs a chi-squared test on a two-way table:

Generate and Replace

The primary commands for creating and changing variables are generate (usually abbreviated gen) and replace (which, like other commands that can destroy information, has no abbreviation). gen creates new variables; replace changes the values of existing variables. Their core syntax is identical:

gen variable = expression or

replace variable = expression

where variable is the name of the variable you want to create or change, and expression is the mathematical expression whose result you want to put in it. Expressions can be as simple as a single number or involve all sorts of complicated functions. You can explore what functions are available by typing help functions. If the expression depends on a missing value at any point, the result is missing. Usually this is exactly what you’d expect and want.

The prices in the auto data set are in 1978 dollars, so it might be useful to convert them to January 2024 dollars. To do so you need to multiply the prices by a conversion factor which is the Consumer Price Index in January 2024 divided by the Consumer Price Index in 1978, or about 5. The code will be:

Add this line to your do file, run it, and examine the results with:

Creating Variables with If Conditions

If a gen command has an if condition, the resulting variable will (and must) still exist for all observations. However it will be assigned a missing value for observations where the if condition is not true. If a replace command has an if condition, observations where the if condition is not true will be left unchanged. This allows you to set variables to different values for different groups of observations.

Suppose you wanted to collapse the five-point scale of the rep78 variable into a three-point scale. The first step is to lay out exactly how you want to do that in your native language, because if it’s not clear to you you’ll never be able to explain it to Stata. We’ll declare that cars with a rep78 of one or two will get a one for the new variable rep3, cars with a three for rep78 will get a two, and cars with a four or five will get a three.

You can implement that with:

gen rep3 = 1 if rep78<3
replace rep3 = 2 if rep78==3
replace rep3 = 3 if rep78>3 & rep78<.
(64 missing values generated)

(30 real changes made)

(29 real changes made)

Recode

The recode command gives you an alternative way of creating rep3. It is designed solely for recoding tasks and is much less flexible than gen and replace. But it’s very easy to use. The syntax is:

recode var (rule 1) (rule 2) (more rules as needed...), gen(newvar)

The gen option at the end is not required—if it’s not there then the original variable will be changed rather than Stata creating a new variable containing the new values. You can also have recode work on a list of variables, recoding them all in the same way.

The core of the recode command is a list of rules, each in parentheses, that tell it how a variable is to be recoded. They take the form (input_value = output_value). The input_value can be a single number, a list of numbers separated by spaces, or a range of numbers specified with start/end. The output_value will always be a single number. Anything not covered by a rule is left unchanged, so you can use recode to change just a few values of a variable or completely redefine it as we do here.

Indicator Variables

In creating indicator variables you can take advantage of the fact that Stata treats true as one and false as zero by setting the new variable equal to a condition. Consider:

(The parentheses are optional, but make the command easier to read.) This creates an indicator variable called low_mpg which is one (true) for cars where mpg is less than twenty and zero (false) where mpg is greater than or equal to twenty. To see the results run:

String Variables

You can create and change string variables with gen and replace just like numeric variables. One difference is that string values go in quotes; another is that for a string variable missing is ““, i.e. a string that contains nothing. For example:

You can’t do math with strings, but there are a variety of useful functions for working with them. One of them is strpos(). Given two strings (string values or string variables), it will return the position of the second string within the first string, or a zero if the first string does not contain the second string. This makes it very useful in if conditions. For example, you can select all Volkswagen cars with:

Remember, in an if condition zero is false and anything else is true, so this is equivalent to:

list make if strpos(make, "VW")!=0
     | make        |
     |-------------|
 70. | VW Dasher   |
 71. | VW Diesel   |
 72. | VW Rabbit   |
 73. | VW Scirocco |
     +-------------+

Another useful function is word(). Given a string and a number n, word will return the nth word in the string. For example, you can make a variable containing the manufacturer of each car with:

gen manufacturer = word(make, 1)
list make manufacturer if strpos(make, "VW")
     | make          manufa~r |
     |------------------------|
 70. | VW Dasher           VW |
 71. | VW Diesel           VW |
 72. | VW Rabbit           VW |
 73. | VW Scirocco         VW |
     +------------------------+

Labels

Good labels make your data much easier to understand and work with. While Stata has many kinds of labels, we’ll focus on the most common and most useful: variable labels and value labels.

Variable Labels

Variable labels convey information about a variable, and can be a substitute for long variable names. This data set already has a good set of variable labels, as you can see in the Variables window, but let’s make the label on price more specific. The syntax to set a variable label is:

label variable variable_name “label” So type:

label variable price "Price in 1978 Dollars"

You can use the describe command to get information about a variable, including its variable label:

describe price
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
price           int     %8.0gc                Price in 1978 Dollars

Value Labels

Value labels are used with categorical variables to tell you what the categories mean. We’ve seen one in action with the foreign variable: it was the value labels that told us that a zero means “Domestic” and a one means “Foreign.”

Let’s explore value labels by labeling the values of rep3, the new variable we recoded to collapse rep78 from a five point scale to a three point scale. Value labels are a mapping from a set of integers to a set of text descriptions, so the first step is to define the map. To do so, use the label define command:

label define map_name value1 "label1" value2 "label2"... Thus:

label define rep_label 1 "Bad" 2 "Average" 3 "Good"

This creates a mapping called rep_label but does not apply it to anything. Before it does anything useful you have to tell Stata to label the values of the rep3 variable using the rep_label mapping you just defined. The syntax is:

label values variable map And thus:

label values rep3 rep_label

To see the results, run:

tab rep3
       rep3 |      Freq.     Percent        Cum.
------------+-----------------------------------
        Bad |         10       14.49       14.49
    Average |         30       43.48       57.97
       Good |         29       42.03      100.00
------------+-----------------------------------
      Total |         69      100.00

Once a map is defined you can apply it to any number of variables: just replace the single variable in the label values command above with a list of variables. Suppose you’re working with survey data and your variables include the gender of the respondent, the gender of the respondent’s spouse, and the genders of all the respondent’s children. You could define just one map called gender and then use it to label the values of all the gender variables.

Three commands for managing value labels: label dir gives you a list of all the defined labels, and label list tells you what they mean. The describe command we used earlier to see the variable label also tells you the name of the value label associated with the variable (and other useful things).

Labels via Recode

When you use recode to create a new variable, Stata will automatically create a variable label for it (“RECODE of …”). You can also define value labels for it by putting the desired label for each value at the end of the rule that defines it. Create yet another version of rep3, this time with labels right from its creation, with:

recode rep78 (1 2 = 1 "Bad") (3 = 2 "Average") (4 5 = 3 "Good"), gen(rep3c)
(67 differences between rep78 and rep3c)
Importing Comma delimited files (CSV)

we can use import delimited to import the data from CS1policies.csv

  • a simple dataset that i use for teaching CS1 exams
import delimited "CS1policies.csv", clear
(encoding automatically selected: ISO-8859-1)
(4 vars, 1,000 obs)
summarize ,detail
                             v1
-------------------------------------------------------------
      Percentiles      Smallest
 1%         10.5              1
 5%         50.5              2
10%        100.5              3       Obs               1,000
25%        250.5              4       Sum of wgt.       1,000

50%        500.5                      Mean              500.5
                        Largest       Std. dev.      288.8194
75%        750.5            997
90%        900.5            998       Variance       83416.67
95%        950.5            999       Skewness              0
99%        990.5           1000       Kurtosis       1.799998

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           30             30
 5%           31             30
10%           33             30       Obs               1,000
25%           37             30       Sum of wgt.       1,000

50%         44.5                      Mean             44.602
                        Largest       Std. dev.      8.459501
75%           52             60
90%           56             60       Variance       71.56316
95%           58             60       Skewness       .0442793
99%           60             60       Kurtosis       1.844854

                          duration
-------------------------------------------------------------
      Percentiles      Smallest
 1%         12.8             12
 5%        16.95           12.1
10%         23.4           12.1       Obs               1,000
25%         39.9           12.1       Sum of wgt.       1,000

50%         67.1                      Mean            66.7856
                        Largest       Std. dev.      31.45082
75%        93.95          119.8
90%        111.5          119.9       Variance        989.154
95%       115.45          119.9       Skewness      -.0036626
99%       118.95          119.9       Kurtosis       1.813458

                           claimed
-------------------------------------------------------------
      Percentiles      Smallest
 1%            0              0
 5%            0              0
10%            0              0       Obs               1,000
25%            0              0       Sum of wgt.       1,000

50%            0                      Mean               .222
                        Largest       Std. dev.      .4157991
75%            0              1
90%            1              1       Variance       .1728889
95%            1              1       Skewness       1.337853
99%            1              1       Kurtosis       2.789852

We can type describe to view the contents of the data in memory.

describe
Contains data
 Observations:         1,000                  
    Variables:             4                  
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
v1              int     %8.0g                 
age             byte    %8.0g                 
duration        float   %9.0g                 
claimed         byte    %8.0g                 
-------------------------------------------------------------------------------
Sorted by: 
     Note: Dataset has changed since last saved.

Codebook

Codebook gives detailed information for certain variables.

codebook age
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [30,60]                       Units: 1
         Unique values: 31                        Missing .: 0/1,000

                  Mean: 44.602
             Std. dev.: 8.4595

           Percentiles:    10%       25%       50%       75%       90%
                            33        37      44.5        52        56

Conditional workbook

codebook age if claimed  == 1
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [30,60]                       Units: 1
         Unique values: 31                        Missing .: 0/222

                  Mean: 45.9324
             Std. dev.: 8.55908

           Percentiles:     10%       25%       50%       75%       90%
                             33        39        46        54        57

Compact codebook

  • this command provides a compact summary statistics of the variables in the dataset
codebook , compact
Variable    Obs Unique     Mean  Min    Max  Label
-------------------------------------------------------------------------------
v1         1000   1000    500.5    1   1000  
age        1000     31   44.602   30     60  
duration   1000    641  66.7856   12  119.9  
claimed    1000      2     .222    0      1  
-------------------------------------------------------------------------------

Tables and frequency summaries

Next we can tabulate the variable ``

tabulate claimed
    claimed |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |        778       77.80       77.80
          1 |        222       22.20      100.00
------------+-----------------------------------
      Total |      1,000      100.00

Or

table claimed
        |  Frequency
--------+-----------
claimed |           
  0     |        778
  1     |        222
  Total |      1,000
--------------------

Or

tabstat claimed
    Variable |      Mean
-------------+----------
     claimed |      .222
------------------------
  • tab1 is used when we have multiple variables to be tabulised

to create a table containing mean values for various numerical variables with respect to a categorical variable we use table

table claimed ,statistic(mean duration age)
        |  duration        age
--------+---------------------
claimed |                     
  0     |  72.20386   44.22237
  1     |   47.7973   45.93243
  Total |   66.7856     44.602
------------------------------
summarize age duration
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
         age |      1,000      44.602    8.459501         30         60
    duration |      1,000     66.7856    31.45082         12      119.9
/*Lists the variables in the dataset */
ds
v1        age       duration  claimed
/* First 10 observations */
list in 1/10 
     | v1   age   duration   claimed |
     |-------------------------------|
  1. |  1    55       24.3         1 |
  2. |  2    45       79.2         0 |
  3. |  3    33       77.8         0 |
  4. |  4    41       79.3         0 |
  5. |  5    53        105         0 |
     |-------------------------------|
  6. |  6    42       81.2         0 |
  7. |  7    44         13         1 |
  8. |  8    33       37.1         0 |
  9. |  9    59       83.9         0 |
 10. | 10    60       67.5         0 |
     +-------------------------------+
/* First 10 observations */
list in 1/10 
     | v1   age   duration   claimed |
     |-------------------------------|
  1. |  1    55       24.3         1 |
  2. |  2    45       79.2         0 |
  3. |  3    33       77.8         0 |
  4. |  4    41       79.3         0 |
  5. |  5    53        105         0 |
     |-------------------------------|
  6. |  6    42       81.2         0 |
  7. |  7    44         13         1 |
  8. |  8    33       37.1         0 |
  9. |  9    59       83.9         0 |
 10. | 10    60       67.5         0 |
     +-------------------------------+
/* Last 10 observations */
list in -1/10
observation numbers out of range
r(198);

r(198);
/* Show first 10 observations of the */
/*  first three variables of our data */
list age-duration in 1/10
     | age   duration |
     |----------------|
  1. |  55       24.3 |
  2. |  45       79.2 |
  3. |  33       77.8 |
  4. |  41       79.3 |
  5. |  53        105 |
     |----------------|
  6. |  42       81.2 |
  7. |  44         13 |
  8. |  33       37.1 |
  9. |  59       83.9 |
 10. |  60       67.5 |
     +----------------+
/*View data */
browse 
request ignored because of batch mode

Generating new variables

gen logAge = log(age)
su
    Variable |        Obs        Mean    Std. dev.       Min        Max
-------------+---------------------------------------------------------
          v1 |      1,000       500.5    288.8194          1       1000
         age |      1,000      44.602    8.459501         30         60
    duration |      1,000     66.7856    31.45082         12      119.9
     claimed |      1,000        .222    .4157991          0          1
      logAge |      1,000    3.779292    .1939759   3.401197   4.094345

Missing data

misstable summ
(variables nonmissing or string)
misstable patterns
(no missing values)
misstable tree
(no missing values)
misstable nested
(no missing values)
Graphics

Distribution of One Variable

We’ll start with data visualizations that show the distribution of a variable.

Continuous Variables

A histogram will tell you more about the distribution of a continuous variable than summary statistics, and they’re easy to make. Just run:

hist age
(bin=29, start=30, width=1.0344828)

Stata Graph - Graph 0 .01 .02 .03 .04 .05 Density 30 40 50 60 age Stata likes to think of a histogram as an empirical approximation to a probabilithy distribution function, but to get the kind of histogram you learned about in elementary school where the height of the bar is proportional to the number of observation in the bin, add the freq option:

hist age, freq
(bin=29, start=30, width=1.0344828)

Stata Graph - Graph 0 10 20 30 40 50 Frequency 30 40 50 60 age

age
histogram age , frequency normal
(bin=29, start=30, width=1.0344828)

Stata Graph - Graph 0 10 20 30 40 50 Frequency 30 40 50 60 age

age
hist claimed , discrete percent
(start=0, width=1)

Stata Graph - Graph 0 20 40 60 80 Percent -.5 0 .5 1 claimed

Universal Options

  • *title("")* - allows you to give meaningful titles to your barplot
  • *subtitle("")* - allows for a subtitle below the main title
  • *ytitle("")* - meaningful title on y-axis
  • *note("")* - footnotes on graph
histogram age , frequency normal title("Histogram of Age") note("author : Bongani Ncube") ytitle("Age of ....")
(bin=29, start=30, width=1.0344828)

Stata Graph - Graph 0 10 20 30 40 50 Age of .... 30 40 50 60 age author : Bongani Ncube Histogram of Age

age

Categorical Variables

With categorical variables you’re interested in the frequencies. A bar graph won’t show you any more information than you’ll get by using tab to make a frequency table, but you can get a basic understanding of it in a single glance.

With the graph bar command you use the over() option to tell it the variable that defines the bars:

sysuse auto, clear
(1978 automobile data)
graph bar, over(rep78)

If you have labels for the bars (and you should) there’s a good chance the labels will overlap. This problem is goes away immediately if you use horizontal bars, made with graph hbar:

graph hbar, over(rep78)

Stata Graph - Graph 0 10 20 30 40 percent 5 4 3 2 1 By default, graph hbar (or graph bar) calculates the percentage of observations in each category. You can change that to frequencies by telling it you want to graph the (count):

graph hbar (count), over(rep78)

Stata Graph - Graph 0 10 20 30 frequency 5 4 3 2 1

You can label the bars with those counts using the blabel(bar) option:

graph hbar (count), over(rep78) blabel(bar)

Stata Graph - Graph 11 18 30 8 2 0 10 20 30 frequency 5 4 3 2 1

Now it really contains all the same information as a frequency table.

Bar graphs can get tricky. For more on how to create them and make them look presentable, see Bar Graphs in Stata.

Relationships Between Variables

Data visualizations are also a great tool for understanding the relationships between two or more variables.

One Continuous Variable and One Categorical Variable

If you have a categorical variable and a continuous variable, one measure of the relationship between them is how the mean of the continous variable varies across categories. graph hbar can do that with (mean) and then the name of the continuous variable:

graph hbar (mean) mpg, over(foreign)

Stata Graph - Graph 0 5 10 15 20 25 mean of mpg Foreign Domestic

This can tell you there’s a difference between the categories, but doesn’t tell you much about how the distribution of the continuous variable varies between them. A box plot will tell you more:

graph box mpg, over(foreign)

Stata Graph - Graph 10 20 30 40 Mileage (mpg) Domestic Foreign

Working from the center out: the line in the middle of the box is the median, and the top and bottom of the box is the 75th and 25th percentile respectively. The “whiskers” outside the box go to the upper adjacent and lower adjacent values. (To find the upper/lower adjacent value, take the 75th/25th percentile, add/subtract 1.5 times the difference between the 75th and 25th percentile, and find the largest/smallest value below/above that number.) Observations outside of the whiskers get their own dot.

Violin plot

ssc install violinplot
host not found
http://fmwww.bc.edu/repec/bocode/v/ either
  1)  is not a valid URL, or
  2)  could not be contacted, or
  3)  is not a Stata download site (has no stata.toc file).
r(631);

r(631);

A simple alternative is to create a histogram for each category. You can do this by adding the by() option to the hist command. This is conceptually similar to the by prefix:

hist mpg, by(foreign)

Stata Graph - Graph 0 .05 .1 10 20 30 40 10 20 30 40 Domestic Foreign Density Mileage (mpg) Graphs by Car origin

Two Categorical Variables

Plotting relationships between two categorical variables can be fun, but it gets complicated. See Bar Graphs in Stata.

Two Continuous Variables

The classic plot for exploring the relationship between two continous variables is a scatterplot, easily created with scatter:

scatter mpg weight

Stata Graph - Graph 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.)

Three Variables

You can add a third variable to the mix by having it determine the color of the points in a scatterplot. As of Stata 18, this can be done with the colorvar option. By default, Stata will treat the color variable as continuous:

scatter mpg weight, colorvar(displacement)

Stata Graph - Graph 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.) 100 200 300 400 500 Displacement (cu. in.)

displacement is a measure of the size of a car’s engine. This plot shows us that there is a strong relationship between a car’s gas mileage, its weight, and the size of its engine.

It’s more common to use color to represent a categorical variable. You can tell Stata that the color variable is categorical with the colordiscrete option, but unfortunately you also have to tell it to use a sensible legend for categorical variables with coloruseplegend (don’t ask why it has that name) and to use the value labels in the legend with zlabel(, val).

scatter mpg weight, colorvar(foreign) colordiscrete coloruseplegend zlabel(, val)

Stata Graph - Graph 10 20 30 40 Mileage (mpg) 2,000 3,000 4,000 5,000 Weight (lbs.) Foreign Domestic

This shows that foreign cars were generally smaller than domestic cars in 1978, but they frequently have lower gas mileage than domestic cars of comparable size.

Combining Plots

A scatterplot is an example of what Stata calls a twoway plot: a plot with a y and x axis. You can combine twoway plots by putting || (two pipe characters) between them. Each additional plot will go on top of the previous plots, like a coat of paint.

An lfit plot plots a linear fit (i.e. a univariate regression) of two variables. (There’s also qfit for quadratic fit, i.e. with a squared term.) Layer it over the scatterplot with:

scatter mpg weight if !foreign || lfit mpg weight

Stata Graph - Graph 10 15 20 25 30 35 2,000 3,000 4,000 5,000 Weight (lbs.) Mileage (mpg) Fitted values

lfit doesn’t do colorvar(), but you can layer two different lfit plots, one for each subset of the data. This is a good time to use /// to continue a command on the next line:

scatter mpg weight, colorvar(foreign) colordiscrete coloruseplegend zlabel(, val) ///

Stata Graph - Graph 10 20 30 40 2,000 3,000 4,000 5,000 Weight (lbs.) Mileage (mpg) Fitted values Fitted values Foreign Domestic

You’d obviously need to change the labels in the legend before you’d show this to anyone else, and that’s not something we’ll go into. But it’s perfectly adequate to help you understand the relationships between mpg, weight, and foreign.

Day 3 : Msc In Epidemiology

Research Protocol Development 1

Descriptive statistics

use "bus.dta" , clear

Describing the data

codebook
job                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [1,2]                         Units: 1
         Unique values: 2                         Missing .: 0/125

            Tabulation: Freq.  Value
                           59  1
                           66  2

-------------------------------------------------------------------------------
age                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (byte)

                 Range: [21,59]                       Units: 1
         Unique values: 36                        Missing .: 0/125

                  Mean:   38.12
             Std. dev.: 10.1247

           Percentiles:     10%       25%       50%       75%       90%
                             27        30        37        47        53

-------------------------------------------------------------------------------
ht                                                                  (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [1.52,1.91]                   Units: .01
         Unique values: 14                        Missing .: 0/125

                  Mean: 1.64216
             Std. dev.:  .06203

           Percentiles:     10%       25%       50%       75%       90%
                           1.57       1.6      1.63      1.68       1.7

-------------------------------------------------------------------------------
wt                                                                  (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (float)

                 Range: [45.9,97.3]                   Units: .1
         Unique values: 66                        Missing .: 0/125

                  Mean:  65.58
             Std. dev.: 10.009

           Percentiles:    10%       25%       50%       75%       90%
                          52.3      57.3      65.9      72.7      78.2

-------------------------------------------------------------------------------
triglyc                                                             (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [63,484]                      Units: 1
         Unique values: 99                        Missing .: 0/125

                  Mean: 197.648
             Std. dev.: 84.7462

           Percentiles:     10%       25%       50%       75%       90%
                             98       123       195       255       316

-------------------------------------------------------------------------------
sbp                                                                 (unlabeled)
-------------------------------------------------------------------------------

                  Type: Numeric (int)

                 Range: [100,200]                     Units: 1
         Unique values: 15                        Missing .: 0/125

                  Mean: 128.424
             Std. dev.: 18.4929

           Percentiles:     10%       25%       50%       75%       90%
                            110       120       120       136       150
browse
request ignored because of batch mode
describe
Contains data from bus.dta
 Observations:           125                  
    Variables:             6                  26 Jan 2019 12:06
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
job             byte    %9.0g                 
age             byte    %9.0g                 
ht              float   %9.0g                 
wt              float   %9.0g                 
triglyc         int     %9.0g                 
sbp             int     %9.0g                 
-------------------------------------------------------------------------------
Sorted by: 

Label define

label define joblabel 1 "Driver" 2 "Conductor"

label values job joblabel
tab job
        job |      Freq.     Percent        Cum.
------------+-----------------------------------
     Driver |         59       47.20       47.20
  Conductor |         66       52.80      100.00
------------+-----------------------------------
      Total |        125      100.00
hist age , normal
quietly graph export hist2.svg, replace
(bin=11, start=21, width=3.4545455)

Stata Graph - Graph 0 .02 .04 .06 Density 20 30 40 50 60 age

Normality test

  • HO : Data is normally distributed
  • H1 : Data is not normally distributed
swilk age
                   Shapiro–Wilk W test for normal data

    Variable |        Obs       W           V         z       Prob>z
-------------+------------------------------------------------------
         age |        125    0.95915      4.069     3.151    0.00081
  • The p-value is 0.00081, which is less than 0.05. This means we reject the null hypothesis, suggesting that age is not normally distributed.

Automated tables

ssc install table1, replace
ssc install summtab, replace
checking table1 consistency and verifying not already installed...
installing into C:\Users\Admin\ado\plus\...
installation complete.

checking summtab consistency and verifying not already installed...
installing into C:\Users\Admin\ado\plus\...
installation complete.

Generating new variable

gen hypertension = .
replace hypertension =1 if sbp >= 140
replace hypertension =0 if sbp < 140
(125 missing values generated)

(31 real changes made)

(94 real changes made)

What is the code doing?

  • Generate the variable hypertension and initialize it with missing values:
  • Replace hypertension with 1 if the systolic blood pressure (sbp) is greater than or equal to 140:
  • Replace hypertension with 0 if the systolic blood pressure (sbp) is less than 140:
describe hypertension
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
hypertension    float   %9.0g                 
tab hypertension
hypertensio |
          n |      Freq.     Percent        Cum.
------------+-----------------------------------
          0 |         94       75.20       75.20
          1 |         31       24.80      100.00
------------+-----------------------------------
      Total |        125      100.00

Label labels for hypertension

label define hplabel 1 "Yes" 0 "No"

label values hypertension hplabel
tab hypertension
hypertensio |
          n |      Freq.     Percent        Cum.
------------+-----------------------------------
         No |         94       75.20       75.20
        Yes |         31       24.80      100.00
------------+-----------------------------------
      Total |        125      100.00

Summarise a continuous variable across a categorical variable

bysort job: sum age , d
-> job = Driver

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           25             25
 5%           27             27
10%           29             27       Obs                  59
25%           33             28       Sum of wgt.          59

50%           41                      Mean           41.40678
                        Largest       Std. dev.      9.399411
75%           48             58
90%           57             58       Variance       88.34892
95%           58             59       Skewness       .1813159
99%           59             59       Kurtosis       2.056059

-------------------------------------------------------------------------------
-> job = Conductor

                             age
-------------------------------------------------------------
      Percentiles      Smallest
 1%           21             21
 5%           22             22
10%           25             22       Obs                  66
25%           29             22       Sum of wgt.          66

50%           31                      Mean           35.18182
                        Largest       Std. dev.      9.907121
75%           41             54
90%           51             54       Variance       98.15105
95%           54             57       Skewness       .7353005
99%           59             59       Kurtosis       2.477688

use table1 ado file

table1 , vars(age contn \ job cat \ht contn) format(%9.1f) saving("$Tables/BusTable1.xls",replace)
  | Factor           Level       Value       |
  |------------------------------------------|
  | N                            125         |
  |------------------------------------------|
  | age, mean (SD)               38.1 (10.1) |
  |------------------------------------------|
  | job              Driver      59 (47.2%)  |
  |                  Conductor   66 (52.8%)  |
  |------------------------------------------|
  | ht, mean (SD)                1.6 (0.1)   |
  +------------------------------------------+
file /BusTable1.xls saved

Using sumtab to generate table 1

summtab , by(hypertension) catvars(job) 
contvars(age ht wt) word wordname( 
table1_bus) median medfmt(0) total title(
"Table 1: Summary statistics by hypertension status") replace
Must specify either Word or Excel output (or both)
r(198);

r(198);
Data Management

Example: Stepping Stones

Stepping Stones is a participatory HIV prevention programme that aims to improve sexual health through building more gender-equitable relationships Cluster Randomized Trial was conducted among young rural men and women in the Eastern Cape Province in South Africa to assess impact of Stepping Stones on HIV and HSV 2 incidence and sexual practices .The 70 study clusters comprised 64 villages and six townships Clusters grouped into seven strata, one stratum comprised the townships and six were villages grouped according to proximity to particular roads

Within each stratum, equal numbers of clusters were allocated to the two study arms Intervention arm in which participants were given the 13 Stepping Stones sessions over a period of three months Control arm in which participants were given a single 3 hour session on HIV prevention

In each cluster recruited about 20 men and 20 women .Those eligible were aged 16 – 24, resident in village where they were at school, and mature enough to understand the study and the consent process - most were recruited from schools In this study unit of randomisation was a cluster of 20 men or a cluster of 20 women Primary outcomes were HIV-incidence and HSV-2 incidence over the study period of approximately two years.

Task

We want to join the two data sets to see which women have become HIV-infected (incident cases or sero-conversions), and to see whether there is any consistent pattern in the experience of IPV Women in the study each have a unique study identification number (idnum) .We can join the data sets together using the merge command as shown below.

Merge datasets
  • two datasets are joined side by side using “merge 1:1”
  • in the datasets to come ,idnum must uniquely identify observations in each dataset

Snapshot of baseline data

idnum visitnox intdatex hivx ipvnewx
1001 1 2003-03-19 0 0
1002 1 2003-03-19 0 0
1006 1 2003-03-19 0 0
1008 1 2003-03-19 0 0
1011 1 2003-03-19 1 0
1012 1 2003-03-19 1 1
1015 1 2003-03-19 0 1
1016 1 2003-03-19 0 0
1017 1 2003-03-19 1 0
1018 1 2003-03-19 0 1

Snapshot of followup data

idnum intdate hiv ipvnew
1001 2004-03-24 0 0
1002 2004-03-24 0 0
1006 2004-03-25 0 0
1008 2004-03-24 0 1
1011 2004-05-12 1 1
1012 2004-04-15 NA 0
1015 2004-03-24 0 0
1016 2004-04-15 0 1
1017 2004-04-15 NA 0
1018 2004-04-15 0 1

the unique column is idnum

use "Datasets/stonwombas.dta", clear
use "Datasets/stonwomfol.dta", clear
(Stepping Stones women baseline)

(Stepping Stones women 12 months)
merge the datasets
  • We need to start with introducing the first data file to Stata through the command, use
  • we then specify the required dataset to join to using using
  • if you want to specify the name of the ‘merge’ variable, the generate option comes in handy. After specifying the master file through.

The summary above shows that 1,109 individuals had their data merged, whereas 306 were not merged because they did not match. 306 were not merged from the master file while 0 were not merged from the using file.

  • in our case it implies that – in this case 1,109 women had observations in both data sets, while 306 only had observations in the baseline data set and not in the follow-up data set – this is because these women were lost to follow-up.
tab stonmerge
   Matching result from |
                  merge |      Freq.     Percent        Cum.
------------------------+-----------------------------------
        Master only (1) |        306       21.63       21.63
            Matched (3) |      1,109       78.37      100.00
------------------------+-----------------------------------
                  Total |      1,415      100.00

We can now use data from both of the datasets e.g. we can see how many women HIV sero-converted during the twelve month follow-up period

tab hivx hiv , missing
       HIV |               hiv
serostatus |         0          1          . |     Total
-----------+---------------------------------+----------
         0 |       900         65        291 |     1,256 
         1 |         1        104         54 |       159 
-----------+---------------------------------+----------
     Total |       901        169        345 |     1,415 

Of 1256 women who tested HIV-negative at baseline, 65 sero-converted (i.e. became HIV-infected) while 900 remained HIV-negative (and remainder did not have a follow-up result) .Note one woman tested HIV-positive at baseline but HIV-negative at follow-up We can identify this woman as participant number 1870 Would need to go to original forms and fieldworkers to understand what happened with this participant (possible for example that a friend “replaced” her at the follow-up visit)

  • Can compare the proportion of women experiencing IPV (Intimate Partner Violence) at follow-up, according to whether or not they had experienced IPV at baseline
tab ipvnewx ipvnew , row
| Key            |
|----------------|
|   frequency    |
| row percentage |
+----------------+

           |        ipvnew
   ipvnewx |         0          1 |     Total
-----------+----------------------+----------
         0 |       617        137 |       754 
           |     81.83      18.17 |    100.00 
-----------+----------------------+----------
         1 |       201        154 |       355 
           |     56.62      43.38 |    100.00 
-----------+----------------------+----------
     Total |       818        291 |     1,109 
           |     73.76      26.24 |    100.00 

Amongst women who had not experienced IPV at baseline, 18.2% experienced IPV at follow-up, while amongst women who had experienced IPV at baseline, 43.4% experienced IPV at follow-up

Finally using the visit dates, we can look at the distribution of follow-up days between the two visits - which should be roughly 365 days since follow-up was at 12 months

Median follow-up time was 386 days (slightly larger than the expected 365 days) and mean was 409 days - due to some participants only being traced after about 2 years Two strange values – one with negative follow-up days (meaning that follow-up visit was recorded as having taken place before baseline visit) and one with only 4 days of follow-up between visits – we can identify the participants (idnum 1423 and idnum 1666) but would need to look at original fieldwork records in order to resolve query

Many to one merging

In above example joined two data sets using a one-to-one merge, since each data set had only one observation per participant

In some cases there will be many observations per participant in one data set and only one observation per participant in the other data set

Ex: Longitudinal studies where all of follow-up observations are put into the same data set (which thus has many observations per participant) while the other data set contains baseline and design information (and thus one observation per participant)

Example: COSTOP randomized controlled trial carried out to investigate whether it is safe for HIV-infected patients stabilized on ART (on ART for at least six months, on CTX prophylaxis and with a CD4 count above 250 cells/µl) to stop taking CTX prophylaxis Total of 2180 patients individually randomized to either continue taking CTX or to take an equivalent placebo (i.e. to stop CTX prophylaxis).

One secondary objective was to compare neutrophil counts over time between the two treatment arms, since CTX has some haematological toxicity Data on neutrophil counts given in costop_neutrophil, while baseline data given in costop_base

use "Datasets/costop_base.dta" , clear
sex ageyrs site whostbas cdstrat idnum
1 46 1 2 1 1001
2 41 1 2 1 1002
2 47 1 2 1 1003
1 42 1 3 1 1004
2 38 1 3 1 1005
2 51 1 2 1 1006
2 48 1 4 1 1007
1 48 1 2 1 1008
2 35 1 2 1 1009
2 23 1 2 1 1010
desc
Contains data from Datasets/costop_base.dta
 Observations:         2,180                  
    Variables:             6                  23 Jan 2023 12:34
-------------------------------------------------------------------------------
Variable      Storage   Display    Value
    name         type    format    label      Variable label
-------------------------------------------------------------------------------
sex             byte    %8.0g      sexlab     Gender
ageyrs          byte    %9.0g                 
site            byte    %8.0g      sitelab    study Site
whostbas        byte    %8.0g                 Baseline WHO stage
cdstrat         byte    %8.0g      cdlab      CD4 stratum at baseline
idnum           float   %9.0g                 
-------------------------------------------------------------------------------
Sorted by: idnum
tab1 sex site cdstrat
-> tabulation of sex  

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |        569       26.10       26.10
     Female |      1,611       73.90      100.00
------------+-----------------------------------
      Total |      2,180      100.00

-> tabulation of site  

 study Site |      Freq.     Percent        Cum.
------------+-----------------------------------
    Entebbe |      1,002       45.96       45.96
     Masaka |      1,178       54.04      100.00
------------+-----------------------------------
      Total |      2,180      100.00

-> tabulation of cdstrat  

CD4 stratum |
at baseline |      Freq.     Percent        Cum.
------------+-----------------------------------
    251-499 |      1,142       52.39       52.39
       500+ |      1,038       47.61      100.00
------------+-----------------------------------
      Total |      2,180      100.00
use "Datasets/costop_neutrophil.dta" , clear
ne_abs idnum months
1.55 1001 2.661191
1.26 1001 5.519507
0.94 1001 8.279261
1.22 1001 10.940452
1.83 1001 13.798768
1.21 1001 16.558521
1.20 1001 19.318275
1.34 1001 22.078030
1.36 1001 24.837782
1.37 1001 27.597536
skimr::skim(costop)
Data summary
Name costop
Number of rows 23181
Number of columns 3
_______________________
Column type frequency:
numeric 3
________________________
Group variables None

Variable type: numeric

skim_variable n_missing complete_rate mean sd p0 p25 p50 p75 p100 hist
ne_abs 88 1 1.81 1.21 0.00 1.15 1.59 2.17 50.00 ▇▁▁▁▁
idnum 0 1 2789.91 1149.98 1001.00 1553.00 3058.00 4028.00 4601.00 ▇▆▃▅▇
months 0 1 16.00 9.55 -6.51 8.28 14.72 22.74 39.75 ▁▇▆▆▂

Data set costop_neutrophil has 23,093 observations, since participants could have a number of hematology tests during the trial (including neutrophil count)

Can now merge neutrophil data to baseline data to get associated characteristics (e.g. sex, age, study site, CD4 stratum, WHO stage) corresponding to the neutrophil counts

Many observations in neutrophil data (from a single participant) will be merged to a single observation in the baseline data – so known as many-to-one or m:1 merging

Also sometimes called “table lookup” since data for a given participant are “looked up” in the baseline table

This confirms that on average neutrophil counts are slightly higher in Entebbe than in Masaka .Note that we save the merged data set as “costop_ndm.dta” to use in the next three sections

In neutrophil data set, for each participant we might want to select 1st neutrophil count (to get estimate of this at baseline or enrolment) and also select last neutrophil count (to get estimate at end of the trial) We will see how to do this below:

table(first_obs$sex)

   1    2 
 569 1611 
tab1 sex site cdstrat if ne_abs<.
-> tabulation of sex if ne_abs<. 

     Gender |      Freq.     Percent        Cum.
------------+-----------------------------------
       Male |        555       25.81       25.81
     Female |      1,595       74.19      100.00
------------+-----------------------------------
      Total |      2,150      100.00

-> tabulation of site if ne_abs<. 

 study Site |      Freq.     Percent        Cum.
------------+-----------------------------------
    Entebbe |        991       46.09       46.09
     Masaka |      1,159       53.91      100.00
------------+-----------------------------------
      Total |      2,150      100.00

-> tabulation of cdstrat if ne_abs<. 

CD4 stratum |
at baseline |      Freq.     Percent        Cum.
------------+-----------------------------------
    251-499 |      1,128       52.47       52.47
       500+ |      1,022       47.53      100.00
------------+-----------------------------------
      Total |      2,150      100.00
Missing data
  • var > 60 is true if variable is greater than 60 or missing
  • to exclude missing data ,ask if the value is less than .
summary(first_obs$ne_abs)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
  0.030   1.170   1.610   1.811   2.180  23.670      30 
summary(first_obs$months)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
 -6.505   2.727   2.760   2.317   2.760  11.039      23 
summa ne_abs , det
                       Neutrophils Abs
-------------------------------------------------------------
      Percentiles      Smallest
 1%          .47            .03
 5%          .73            .06
10%          .87             .1       Obs               2,150
25%         1.17            .23       Sum of wgt.       2,150

50%         1.61                      Mean           1.812577
                        Largest       Std. dev.      1.079763
75%         2.18           7.38
90%         2.92           7.92       Variance       1.165888
95%          3.5           10.1       Skewness       5.411018
99%          5.8          23.67       Kurtosis       85.58683
summa month , det
                           months
-------------------------------------------------------------
      Percentiles      Smallest
 1%    -.6899384      -6.505134
 5%    -.4928131       -3.74538
10%    -.4599589       -3.74538       Obs               2,157
25%     2.726899      -3.055442       Sum of wgt.       2,157

50%     2.759754                      Mean           2.316625
                        Largest       Std. dev.      1.361104
75%     2.759754       9.297741
90%     2.858316       9.626284       Variance       1.852604
95%     3.022587       11.03901       Skewness      -.5982735
99%     5.552361       11.03901       Kurtosis       7.534941

Since we are looking within idnum and have sorted by month within idnum, _n measures number of observation within each participant so _n=1 denotes the first (earliest) hematology test, _n=2 the second test and so on Only 2,156 participants have at least one neutrophil result Note that for over 10% this was found before enrolment i.e. during the screening phase of the trial – as shown by negative values for month Can now see how to find the last neutrophil count

Last observation within each idnum (participant) is labelled _N Note that median and mean neutrophil count are very similar at beginning and end of the trial Now have two data sets – one containing first neutrophil count and other containing last neutrophil count Could merge these data sets (1:1) and hence find for each participant how much neutrophil count has changed over course of the trial

Long and wide format data

The neutrophil data set is an example of a “long” data set since we have a separate row of data for each visit An alternative to this would be a “wide” data set in which we have one row of data for each participant and within this row the first neutrophil count is recorded as ne_abs1, the second as ne_abs2, the third as ne_abs_3 and so on We see how to do this below:

First participant is a 46 year old male from Entebbe with 13 neutrophil results, while second participant is a 41 year old female from Entebbe with 15 neutrophil results

Note that for certain applications a wide data set is preferable, while for others a long data set is preferable We can convert a wide data set to a long data set, provided that the variables to be reshaped end in a digit denoting the serial number (so here ne_abs1, ne_abs2 etc)

reshape long ne_abs months , i(idnum) j(visitnum)
(j = 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
>  29 30)

Data                               Wide   ->   Long
-----------------------------------------------------------------------------
Number of observations            2,156   ->   64,680      
Number of variables                  67   ->   10          
j variable (30 values)                    ->   visitnum
xij variables:
           ne_abs1 ne_abs2 ... ne_abs30   ->   ne_abs
           months1 months2 ... months30   ->   months
-----------------------------------------------------------------------------
collapse to summarise data

The long data set above is an example of a clustered data structure, with repeated measures of neutrophil counts clustered within participants We often want to summarize the data at the cluster level – here to get participant level summaries (number of visits and mean neutrophil count) This can be achieved using the collapse command as shown below:

use costop_ndm , clear
drop if ne_abs==.
(111 observations deleted)