书名：R Programming By Example
作者名：Omar Trejo Navarro
本章字数：1636字
更新时间：2021-07-02 21:30:41

Summarizing the data into a data frame

To get a summary of the data, we may execute summary(data) and see the relevant summaries for each type of variable. The summary is tailored for each column's data type. As you can see, numerical variables such as ID and NVotes get a quantile summary, while factor (categorical) variables get a count for each different category, such as AreaType and RegionName. If there are many categories, the summary will show the categories that appear the most and group the rest into a (Other) group, as we can see at the bottom of RegionName.

summary(data)
#>       ID        RegionName              NVotes          Leave
#> Min.   :  1    Length: 1070        Min.   : 1039    Min.   : 287
#> 1st Qu.: 268   Class : character   1st Qu.: 4252    1st Qu.: 1698
#> Median : 536   Mode  : character   Median : 5746    Median : 2874
#> Mean   : 536                       Mean   : 5703    Mean   : 2971
#> 3rd Qu.: 803                       3rd Qu.: 7020    3rd Qu.: 3936
#> Max.   : 1070                      Max.   : 15148   Max.   : 8316
(Truncated output)

From here, we can see that London is the region to which more wards belong, followed by the North West and West Midlands. We can also see that the ward with the least votes in all of the data had only 1,039 votes, the one with the most votes had 15,148, and the mean number of votes per ward was 5,703. We will take a deeper look into these kinds of analyses later in the chapter. For now we'll focus on making this summary data useful for further analysis. As you may have noticed, we can't use the summary() results to make computations. We can try to save the summary into a variable, find out the variable type, and traverse it in an appropriate way. However, if we do that we will find that it's text data, which means that we can't use it for computations as it is:

summary <- summary(data) 
class(summary) 
#> [1] "table"
summary[1]
#> [1] "Min.   : 1  "
class(summary[1])
#> [1] "character"

Surely, there must be a way to get the summary data into a data frame for further analysis. This is R, so you can bet there is! The first thing we should note is that we can't directly translate the output of the summary() function into a data frame because of the non-numerical variables. These non-numerical variables contain a different summary structure which is not composed of the minimum, first quartile, median, mean, third quartile, and maximum values. This means that we first need to subset the data to get only the numerical variables. After all, a data frame is a rectangular structure with well defined rows and columns. If we tried to mix types (by including numerical and non-numerical summaries) into the data frame, we would have a hard time doing so.

To find if a column is numeric or not, we can use the is.numeric() function. For example, we can see that the Proportion column is numeric and the RegionName is not:

is.numeric(data$Proportion) 
#> [1] TRUE
is.numeric(data$RegionName)
#> [1] FALSE

We can then apply is.numeric() to each column by using the sapply() function. This will give us a logical (Boolean) vector with a TRUE or FALSE value for each column, indicating whether or not it's numeric. Then we can use this logical vector to subset our data and get only the numerical columns with data[, numerical_variables]. As you can see, there are no non-numerical columns in the data_numerical object:

numerical_variables <- sapply(data, is.numeric) 
numerical_variables 
#>            ID   RegionName         NVotes       Leave   Residents
#>          TRUE        FALSE           TRUE        TRUE        TRUE
#>    Households      MeanAge   AdultMeanAge   Aget_0to4   Age_5to7
#>          TRUE         TRUE           TRUE        TRUE        TRUE
(Truncated output)

data_numerical <- data[, numerical_variables] 
colnames(data_numerical)
#>  [1] "ID"           "Nvotes"       "Leave"          "Residents"
#>  [5] "Households"   "MeanAge"      "AdultMeanAge"   "Age_0to4"
#>  [9] "Age_5to7"     "Age_8to9"     "Age_10to14"     "Age_15"
#> [13] "Age_16to17    "Age_18to19"   "Age_20to24"     "Age_25to29"
(Truncated output)

Since it doesn't make much sense to get the summary values for the ID variable, we can remove it from the logical vector, effectively treating it as a non-numerical variable. If we do, we must remember to recreate the data_numeric object to make sure it doesn't include the ID variable also:

numerical_variables[["ID"]] <- FALSE 
data_numerical <- data[, numerical_variables]

To create our numerical variables summary, we first will apply the summary() function we used before to each numerical column using the lapply() function. The lapply() function returns a named list, where each list member has the corresponding column name:

lapply(data[, numerical_variables], summary)
#> $NVotes
#>    Min. 1st Qu. Median  Mean 3rd Qu.  Max.
#>    1039    4252   5746  5703   7020  15148
#>
#> $Leave
#>    Min. 1st Qu. Median  Mean 3rd Qu.  Max.
#>     287    1698   2874  2971    3936  8316
#>
#> $Residents
#>    Min. 1st Qu. Median   Mean 3rd Qu.   Max.
#>    1932    8288  11876  11646   14144  34098
#>
(Truncated output)

Now we need to put each member of this list together into a data frame. To do so, we will use the cbind() and do.call() functions. do.call() will consecutively apply cbind() to each member of the list generated by lapply() and return them all together. To get a reminder on how these vectorized operations work, take a look at Chapter 1, Introduction to R:

numerical_summary <- do.call(cbind, lapply(data_numerical, summary))
#> Warning in (function (..., deparse.level = 1) : number of rows of result is
#> not a multiple of vector length (arg 1)

numerical_summary
#>          NVotes Leave Residents Households MeanAge AdultMeanAge Age_0to4
#> Min.       1039   287      1932       779    27.80        29.20    2.200
#> 1st Qu.    4252  1698      8288      3466    35.60        44.10    5.400
#> Median     5746  2874     11876      4938    38.70        47.40    6.300
#> Mean       5703  2971     11646      4767    38.45        46.85    6.481
#> 3rd Qu.    7020  3936     14144      5832    41.40        49.90  7.50058
#> Max.      15148  8316     34098     15726    51.60        58.10   12.300
#> NA's       1039   267      1932       779    27.80        29.20    2.200

We got our results, but not so fast! We got a warning, and it looks suspicious. What does this number of rows of result is not a multiple of vector length message mean? Aha! If we take a more detailed look at the list we previously got from our lapply() function, we can see that in the case of Leave (and Proportion) we get an extra column for NAs that we don't get for any other column. That means that when we try to use cbind() on these columns, the extra NAs column will create an extra space that needs to be filled. This is a problem we looked at in Chapter 1, Introduction to R.

As we saw, then, R deals with it by repeating the vectors in order until all spaces are filled. In our case this means that the first element, the one corresponding to the minimum value, will be repeated for the NAs space for each column that doesn't have an NAs space. You can verify this by comparing the numbers of the Min and NAs columns for variables other than Leave or Proportion (for these two, the values should actually be different).

To fix it we can just remove the extra NA value's row from the resulting data frame, but this would not deal with the warning's source, only the symptom. To deal with the source, we need to have the same number of columns for each variable before we apply cbind(). Since we already know that we have 267 missing values for the Leave variable, which then affects the Proportion variable, we can easily fix this by just ignoring that information. To do so, we simply use the complete cases, meaning that we keep observations that don't have any NA values in any of their variables; or, put another way, we drop every observation that contains at least one NA. Once we do that, we get our results back and we don't get any warnings:

data <- data[complete.cases(data), ] 
data_numerical <- data[, numerical_variables] 
numerical_summary <- do.call(cbind, lapply(data_numerical, summary))
numerical_summary
#>         NVotes Leave Residents Households MeanAge AdultMeanAge Age_0to4
#> Min.      1039   287      1932        779   28.40        30.50    2.200
#> 1st Qu.   4242  1698      8405       3486   35.70        44.10    5.400
#> Median    5739  2874     11911       4935   38.60        47.40    6.300
#> Mean      5725  2971     11739       4793   38.43        46.83    6.479
#> 3rd Qu.   7030  3936     14200       5850   41.40        49.90    7.500
#> Max.     15148 8316      34098      15726   47.30        56.10   12.300
(Truncated output)

If we want to get the summary values as columns and the variables as rows, we can use the rbind() function instead of cbind(). The structure we actually end up using will depend on what we want to do with it. However, we can easily change between them later if we need to:

do.call(rbind, lapply(data_numerical, summary))
#>                  Min.    1st Qu.    Median       Mean     3rd Qu.      Max.
#> NVotes     1039.0000  4241.5000  5.739e+03  5.725e+03  7.030e+03  1.515e+04
#> Leave       287.0000  1697.5000  2.874e+03  2.971e+03  3.936e+03  8.316e+03
#> Residents  1932.0000  8405.0000  1.191e+04  1.174e+04  1.420e+04  3.410e+04
#> Households  779.0000  3486.0000  4.935e+03  4.793e+03  5.850e+03  1.573e+04
#> MeanAge      28.4000    35.7000  3.860e+01  3.843e+01  4.140e+01  4.730e+01

Now that we have this numerical_summary object, we can use it to perform computations, such as finding the range between the wards with the least and most proportions of votes in favor of leaving (0.6681), which may be useful to interpret the big difference among the types of wards we may find in the UK. If we want to know which wards are being used to get to this result, we can search for the wards with the least and most proportion of votes:

numerical_summary["Max.", "Proportion"] - numerical_summary["Min.", "Proportion"] 
desired_variables <- c( 
    "ID", 
    "NoQuals", 
    "Proportion", 
    "AdultMeanAge", 
    "L4Quals_plus", 
    "RegionName" 
) 

>data[which.max(data$Proportion), desired_variables]
#>        ID NoQuals Proportion AdultMeanAge L4Quals_plus RegionName
#> 754   754    35.8     0.7897         48.7         13.7          L

data[which.min(data$Proportion), desired_variables]
#>      ID NoQuals Proportion AdultMeanAge L4Quals_plus RegionName
#> 732 732     2.8     0.1216         31.2         44.3         EE

As you can see, this analysis already shows some interesting results. The UK ward that voted to leave the EU the most is characterized by older people (MeanAge) with low education levels (NoQuals, L4Quals_plus). On the other hand, the UK ward that voted to remain in the EU the most is characterized by younger people with much higher education levels. Of course, this is not the full picture, but it's a hint about the direction in which we need to look to further understand what's going on. For now, we have found that education and age seem to be relevant variables for the analysis.