In this eighteenth article in the R series, we take a look at the various data sets available in R.
The data sets available in R cover a wide range of fields. We have data sets that give information on the performance of automobiles, the approval rating of the presidents of the United States, the magnitude of the earthquakes around Fiji, the number of international passengers travelling in aeroplanes over a certain period, and much more.
mtcars
We have been using the mtcars data set which contains information on automobiles, fuel consumption and performance from the 1974 Motor Trend US magazine. The data frame has 32 entries and has 11 numeric fields, as shown below:
> head(mtcars) mpg cyl disp hp drat wt qsec vs am gear carb Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
The various fields are described below:
Field | Description |
mpg | Miles/gallon |
cyl | Number of cylinders |
disp | Displacement |
hp | Horse power |
drat | Rear axle ratio |
wt | Weight in 1000 lbs |
qsec | 1/4 mile time |
vs | Engine (0=V-shaped, 1=straight) |
am | Transmission (0=automatic, 1=manual) |
gear | Number of gears |
carb | Number of carburettors |
airquality
The airquality data set provides air quality measurements in New York between May and September 1973. The data was obtained by the New York State Department of Conservation and the National Weather Service. The data frame has 153 observations and six variables, as follows:
> head(airquality) Ozone Solar.R Wind Temp Month Day 1 41 190 7.4 67 5 1 2 36 118 8.0 72 5 2 3 12 149 12.6 74 5 3 4 18 313 11.5 62 5 4 5 NA NA 14.3 56 5 5 6 28 NA 14.9 66 5 6
The description of the numeric fields is as follows:
Field | Description |
Ozone | Ozone in parts per billion |
Solar.R | Solar radiation in Langleys |
Wind | In mph |
Temp | Temperature in degrees Fahrenheit |
Month | Values: 1-12 |
Day | Values: 1-31 |
AirPassengers
An example of time series data is provided by the AirPassengers data set given by Box & Jenkins. It contains the monthly totals of international airline passengers (in thousands) between 1949 and 1960. The data set is illustrated below for reference:
> AirPassengers Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1949 112 118 132 129 121 135 148 148 136 119 104 118 1950 115 126 141 135 125 149 170 170 158 133 114 140 1951 145 150 178 163 172 178 199 199 184 162 146 166
presidents
Another example of time series data is the approval rating for the President of the United States for various quarters from 1945 till 1974 in the presidents data set. It has 120 values and has been provided by the Gallup Organisation.
> presidents Qtr1 Qtr2 Qtr3 Qtr4 1945 NA 87 82 75 1946 63 50 43 32 1947 35 60 54 55 1948 36 39 NA NA 1949 69 57 57 51 1950 45 37 46 39 ...
Titanic
A summary of the passengers who travelled on the Titanic ship is available in a four-dimensional array categorised by economic status, gender, age and survival in the Titanic data set. The complete information on the fate of the passengers is given below:
> Titanic , , Age = Child, Survived = No Sex Class Male Female 1st 0 0 2nd 0 0 3rd 35 17 Crew 0 0 , , Age = Adult, Survived = No Sex Class Male Female 1st 118 4 2nd 154 13 3rd 387 89 Crew 670 3 , , Age = Child, Survived = Yes Sex Class Male Female 1st 5 1 2nd 11 13 3rd 13 14 Crew 0 0 , , Age = Adult, Survived = Yes Sex Class Male Female 1st 57 140 2nd 14 80 3rd 75 76 Crew 192 20
The variables and their values are described below:
Name | Values |
Class | 1st / 2nd / 3rd / Crew |
Gender | Male / Female |
Age | Child / Adult |
Survived | Yes / No |
quakes
The quakes data set provides the locations of 1000 seismic activities. It reports earthquake magnitude scales of MB > 4.0 near Fiji since 1964. A sample output from the data set is given below for reference:
> head(quakes) lat long depth mag stations 1 -20.42 181.62 562 4.8 41 2 -20.62 181.03 650 4.2 15 3 -26.00 184.10 42 5.4 43 4 -17.97 181.66 626 4.1 19 5 -20.42 181.96 649 4.0 11 6 -19.68 184.31 195 4.0 12
The numeric variables and their descriptions are as follows:
Name | Description |
lat | Latitude |
lon | Longitude |
depth | In km |
mag | Richter magnitude |
stations | Number of stations that reported activity |
melanoma
The measurements on patients who had malignant melanoma between 1962 and 1977 are available in the melanoma data set. The tumours were completely removed from these patients by surgery, and measurements were taken. The patients were reviewed until 1977. A sample data set is as follows:
> library(boot) > head(melanoma) time status sex age year thickness ulcer 1 10 3 1 76 1972 6.76 1 2 30 3 1 56 1968 0.65 0 3 35 2 1 41 1977 1.34 0 4 99 3 0 71 1968 2.90 0 5 185 1 1 52 1965 12.08 1 6 204 1 1 28 1971 4.84 1
The data frame consists of the following columns:
Name | Description |
time | Survival time in days since being operated |
status | 1=Died from melanoma, 2=Alive, 3=Died from other causes |
sex | 1=Male, 0=Female |
year | Year of operation |
thickness | Tumour thickness (mm) |
ulcer | 1=present, 0=absent |
nitrofen
The nitrofen data set has 50 rows and five columns. It is a herbicide that was used to control weeds in cereals and rice. Although nitrofen is non-toxic to humans, it is no longer used in the US. The data frame is given below:
> library(boot) > head(nitrofen) conc brood1 brood2 brood3 total 1 0 3 14 10 27 2 0 5 12 15 32 3 0 6 11 17 34 4 0 6 12 15 33 5 0 6 15 15 36 6 0 5 14 15 34
The five fields are described below in detail:
Name | Description |
conc | The nitrofen concentration (mug/litre) |
brood1 | Number of live offspring in the first brood |
brood2 | Number of live offspring in the second brood |
brood3 | Number of live offspring in the third brood |
total | Total number of live offspring in the first three broods |
nuclear
The nuclear data set has information on light water reactor (LWR) plants that were constructed in the US in the early 1970s. Thirty-two plants were constructed, and the data was used for future cost prediction of such plants. The first few entries from the data set are given below:
> library(boot) > head(nuclear) cost date t1 t2 cap pr ne ct bw cum.n pt 1 460.05 68.58 14 46 687 0 1 0 0 14 0 2 452.99 67.33 10 73 1065 0 0 1 0 1 0 3 443.22 67.33 10 85 1065 1 0 1 0 1 0 4 652.32 68.00 11 67 1065 0 1 1 0 12 0 5 642.23 68.00 11 78 1065 1 1 1 0 12 0 6 345.39 67.92 13 51 514 0 1 1 0 3 0
The description of the various numeric columns is as follows:
Name | Description |
cost | Cost of construction (millions of dollars) |
date | Date on which construction permit was issued |
t1 | Time between application and construction permits |
t2 | Time between issue of operating licence and construction permit |
cap | Capacity of power plant (MWe) |
pr | 1=Prior existence of LWR plant on site, 0=None |
ne | 1=Plant in north-east US, 0=Otherwise |
ct | 1=Cooling tower in plant, 0=None |
bw | 1=Nuclear steam supply system by Babcock-Wilcox, 0=Otherwise |
cum.n | Cumulative number of power plants constructed by architect-engineer |
pt | 1=Plant with partial turnkey guarantees, 0=Otherwise |
You are encouraged to explore the various other R data sets available at https://stat.ethz.ch/R-manual/R-devel/library/datasets/html/00Index.html.